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Abstract 

The literature on statistical learning for time series often assumes asymptotic independence 
or "mixing" of data sources. Beta-mixing has long been important in establishing the central 
limit theorem and invariance principle for stochastic processes; recent work has identified it as 
crucial to extending results from empirical processes and statistical learning theory to dependent 
data, with quantitative risk bounds involving the actual beta coefficients. There is, however, 
presently no way to actually estimate those coefficients from data; while general functional forms 
Q\ are known for some common classes of processes (Markov processes, ARMA models, etc.), spe- 

cific coefficients are generally beyond calculation. We present an ^ 1 -risk consistent estimator for 
the beta-mixing coefficients, based on a single stationary sample path. Since mixing coefficients 
involve infinite-order dependence, we use an order-rf Markov approximation. We prove high- 
probability concentration results for the Markov approximation and show that as d — > oo, the 
Markov approximation converges to the true mixing coefficient. Our estimator is constructed 
using d dimensional histogram density estimates. Allowing asymptotics in the bandwidth as 
well as the dimension, we prove L 1 concentration for the histogram as an intermediate step. 
j_j Keywords: density estimation, histograms, mixing,. 
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1 Introduction 

For time series analysis, the independence assumption is replaced by requiring the asymptotic 
independence of events far apart in time, or "mixing". Mixing conditions make the dependence of 
the future on the past explicit, quantifying the decay in dependence as the future moves farther 
from the past. There are many definitions of mixing of varying strength with matching dependence 
coefficients (see [12, 10, 6] for reviews), but many of the results in the statistical literature focus on 
/3-mixing or absolute regularity. Roughly speaking (see Definition 2.1 below for a precise statement), 
the /3-mixing coefficient at lag a is the total variation distance between the actual joint distribution 
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of events separated by a — 1 time steps and the product of their marginal distributions, i.e., the L 
distance from independence. 

Numerous results in the statistics literature rely on knowledge of mixing coefficients. While 
much of the theoretical groundwork for the analysis of mixing processes was laid years ago (cf. [35, 
5, 13, 27, 1, 31, 37, 38]), recent work has continued to use mixing to prove interesting results 
about the analysis of time-series data. Non-parametric inference under mixing conditions is treated 
extensively in Bosq [4]. Baraud et al. [2] study the finite sample risk performance of penalized least 
squares regression estimators under /3-mixing. Kontorovich and Ramanan [18] prove concentration 
of measure results based on a notion of mixing defined therein which is related to the more common 
</)-mixing coefficients. Ould-Sa'id et al. [26] investigate kernel conditional quantile estimation under 
a-mixing. Steinwart and Anghel [30] show that support vector machines are consistent for time 
series forecasting under a weak dependence condition implied by a-mixing. Asymptotic properties 
of nonparametric inference for time series under various mixing conditions are described in Liu and 
Wu [20]. Finally, Lerasle [19] proposes a block-resampling penalty for density estimation. He shows 
that the selected estimator satisfies oracle inequalities under both f3- and T-mixing. 

Many common time series models are known to be /3-mixing, and the rates of decay are known 
up to constant factors given the underlying parameters of the process, say 9. Among the processes 
for which such knowledge is available are ARMA models [23], GARCH models [7], and certain 
Markov processes — see [12] for an overview of such results. Fryzlewicz and Subba Rao [16] derive 
upper bounds for the a- and /3-mixing rates of non-stationary ARCH processes. 

With a few exceptions, the mapping from 9 to the constants in the mixing coefficients is unknown 
in the literature. For example, it is known that the mixing coefficients of the ARMA process at 
time lag a are 0(p a ) for some < p < 1. However, both p and the constants are unrecoverable 
given knowledge of 9. To our knowledge, only Nobel [24] approaches a solution to the problem 
of estimating mixing rates by giving a method to distinguish between different polynomial mixing 
rate regimes through hypothesis testing. 

We present the first method for estimating the /3-mixing coefficients for stationary time series 
data given a single sample path. Our methodology can be applied to real data assumed to be 
generated from some unknown /3-mixing process. Additionally, it can be used to examine known 
mixing processes thereby determining exact mixing rates via simulation. Section 2 defines the 
/3-mixing coefficient and states our main results on convergence rates and consistency for our 
estimator. Section 3 gives an intermediate result on the L 1 convergence of the histogram estimator 
with /3-mixing inputs which is asymptotic in the dimension of the target distribution in addition to 
the bandwidth. Section 4 proves the main results from §2. Section 5 demonstrates the performance 
of our estimator in three simulated examples, both providing good recovery of known rates in simple 
settings as well as providing insight into unknown mixing regimes. Section 6 concludes and lays 
out some avenues for future research. 

2 Estimator and consistency results 

In this section, we present one of many equivalent definitions of absolute regularity and state our 
main results, deferring proof to §4. 

To fix notation, let X = {Xt} ( ^_ OQ be a sequence of random variables where each Xt is a 
measurable function from a probability space (Q, J-, P) into a measurable space X. A block of this 
random sequence will be given by = {Xt}i_. where i and j are integers, and may be infinite. 
We use similar notation for the sigma fields generated by these blocks and their joint distributions. 
In particular, aj will denote the sigma field generated by X^, and the joint distribution of X^ will 
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be denoted Fj. 



2.1 Definitions 

There are many equivalent definitions of /3-mixing (see for instance [12], or [6] as well as Meir [22] 
or Yu [38]), however the most intuitive is that given in Doukhan [12]. 

Definition 2.1 (/?- mixing). For each a £ N and any t £ Z, i/ie coefficient of absolute regularity, 
or /3- mixing coefficient, /3(a), is 

l3(a)^\\F t _ OQ ^Fr +a -F t , a \\ TV (1) 

where \\ ■ \\tv *s the total variation norm, and Pt j0 is the joint distribution of (X < L ao ,X^ a ). A 
stochastic process is said to be absolutely regular, or /3-mixing, if /3(a) — > as a — >■ oo. 

Loosely speaking, Definition 2.1 says that the coefficient (3{a) measures the total variation 
distance between the joint distribution of random variables separated by a — 1 time units and a 
distribution under which random variables separated by a — 1 time units are independent. In the 
most general setting, HP 4 ^ ®P^ fl — Pt, a || r y is preceded by a supremum over t. However, this 
additional generality is unnecessary for stationary random processes X which is the only case we 
consider here. 

Definition 2.2 (Stationarity). A sequence of random variables X is stationary when all its finite- 
dimensional distributions are invariant over time: for all t and all non-negative integers i and j, 
the random vectors X* + * and X*^* +J have the same distribution. 

Our main result requires the method of blocking used by Yu [37, 38] . The purpose is to transform 
a sequence of dependent variables into subsequences of nearly IID ones. Consider a sample X™ from 
a stationary f3- mixing sequence with density /. Let m n and fi n be non- negative integers such that 
2m n /i n = n. Now divide X" into 2\i n blocks, each of length m n . Identify the blocks as follows: 

Uj = {X t : 2(j - l)m n + 1 < i < (2j - l)m n }, 
Vj = {X { : (2j - l)m n + l<i< 2jm n }. 

Let U be the entire sequence of odd blocks Uj , and let V be the sequence of even blocks Vj . Finally, 
let U' be a sequence of blocks which are independent of X™ but such that each block has the same 
distribution as a block from the original sequence. That is construct Uj such that 

C{U'j) = C{Uj)=C{U x ), (2) 

where £(•) means the probability law of the argument. The blocks U' are now an IID block 
sequence, in that for integers i, j < 2/%, i ^ j, U- _LL f/j so standard results about IID random 
variables can be applied to these blocks. (See [38] for a more rigorous analysis of blocking.) With 
this structure, we can state our main result. 

2.2 Results 

Our result emerges in two stages. First, we recognize that the distribution of a finite sample depends 
only on finite-dimensional distributions. This leads to an estimator of a finite-dimensional version 
of /3(a). Next, we let the finite-dimension increase to infinity with the size of the observed sample. 
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For positive integers t, d, and a, define 



f3 d (a) 



t-d+l 



T>t+a+d—l 
t+a 



t,a,i 



TV 



(3) 



where Pt,a,d 1S the joint distribution of (X*_ d+1 , X^ +<i 1 ). Also, let f d be the d-dimensional 
histogram estimator of the joint density of d consecutive observations, and let f% d be the 2d- 
dimensional histogram estimator of the joint density of two sets of d consecutive observations 
separated by a — 1 time points. 

We construct an estimator of (3 d (a) based on these two histograms. 1 Define 



f3 d (a) 



?a d -J d ®f d 



(4) 



We show that, by allowing d = d n to grow with n, this estimator will converge on /3(a). This can be 
seen most clearly by bounding the £ -risk of the estimator with its estimation and approximation 
errors: 

\(3 dn - /3(o)| < \p dn - P dn \ + \(3 dn - P(a)\. 

The first term is the error of estimating (3 d (a) with a random sample of data. The second term is 
the non-stochastic error induced by approximating the infinite dimensional coefficient, /3(a), with 
its d-dimensional counterpart, (3 d (a). 

Our first theorem in this section establishes consistency of (3 dn (a) as an estimator of /3(a) for all 
/3-mixing processes provided d n increases at an appropriate rate. Theorem 2.4 gives finite sample 
bounds on the estimation error while some measure theoretic arguments contained in §4 show that 
the approximation error must go to zero as d n — > oo. 

Theorem 2.3. Le£X™ be a sample from an arbitrary (3-mixing process. Letd n = 0(exp{W(logn)}) 



where W is the Lambert W function. 2 Then f3 dn (a) /3(a 



as n 



oo. 



The 0(exp{VF(log n)}) growth rate of d n in the above theorem leads to the optimal rate of 
decay for the estimation error. A finite sample bound for the estimation error is the first step to 
establishing consistency for (3 dn . This result gives convergence rates for estimation of the finite 
dimensional mixing coefficient f3 d (a) and also for Markov processes of known order d, since in this 
case, /3 d (a) = /3(a). 

Theorem 2.4. Consider a sample X™ from a stationary (3-mixing process. Let fj, n and m n be 
positive integers such that 2fj, n m n = n and \i n > d > 0. Then 



F(\(3 d (a) - (3 d (a)\ > e) < 2 exp 



where ei = e/2-E / \f d - f 



and 62 



E 



+ 2 exp 



J \ fa fa 



+ 4(fi n - \)(3{m n ) 



2d 



Consistency of the estimator (3 d (a) is guaranteed only for certain choices of m n and \i n . Clearly 
(jL n — > oo and /i n (3(m n ) — > as n — > oo are necessary conditions. Consistency also requires con- 
vergence of the histogram estimators to the target densities. We leave the proof of this theorem 



1 While it is clearly possible to replace histograms with other choices of density estimators (most notably kernel 
density estimators), histograms in this case are more convenient theoretically and computationally. See §6 for more 
details. 

2 The Lambert W function is defined as the (multivalued) inverse of f(w) = tiiexp{w}. Thus, 0(exp{W / (log n)}) 
is bigger than O(loglogn) but smaller than O(logn). See for example Corless et al. [8]. 
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for section 4. As an example to show that this bound can go to zero with proper choices of m n 
and jXni the following corollary proves consistency for first order Markov processes. Consistency 
of the estimator for higher order Markov processes can be proven similarly. These processes are 
geometrically /3-mixing as shown in e.g. Nummelin and Tuominen [25]. 

Corollary 2.5. Let X" be a sample from a first order Markov process with (3(a) = (3 1 (a) = 0{p a ) 

for some < p < 1. Then under the conditions of Theorem 2.4, (3 l (a) (3(a) at a rate of o(^/n) 
up to a logarithmic factor. 

Proof. Recall that n = 2p n m n . Then, 

A(n n - l)/3(m„) = 4/i„/3(m n ) + 4/3 (m„) 

71 

= K X r m " +K 2 r mn 

-> 

if m n = f2(logn) for constants K\ and K 2 . But the exponential terms are 

expl-^3^4 
[ ni„ J 

for j = 1, 2 and a constant K%. Therefore, both exponential terms go to as n — > oo if m n = o(n). 
Setting the right side to 5 and solving, gives that as long as m n = f2(logn), then (3 l (a) —> /3(a) at 
a rate of o(y/n) apart from a logarithmic factor. □ 

Proving Theorem 2.4 requires showing the L 1 convergence of the histogram density estimator 
with /3-mixing data. We do this in the next section. 



3 L 1 convergence of histograms 

Convergence of density estimators is thoroughly studied in the statistics and machine learning 
literature. Early papers on the L°° convergence of kernel density estimators (KDEs) include [36, 3, 
29]; Freedman and Diaconis [15] look specifically at histogram estimators, and Yu [37] considered 
the L°° convergence of KDEs for /3-mixing data and shows that the optimal IID rates can be 
attained. Tran [32] proves 1? convergence for histograms under a- and /3-mixing. Devroye and 
Gyorfi [11] argue that L 1 is a more appropriate metric for studying density estimation, and Tran 
[31] proves L 1 consistency of KDEs under a- and /3-mixing. As far as we are aware, ours is the first 
proof of L 1 convergence for histograms under /3-mixing. 

Additionally, the dimensionality of the target density is analogous to the order of the Markov 
approximation. Therefore, the convergence rates we give are asymptotic in the bandwidth h n 
which shrinks as n increases, but also in the dimension d which increases with n. Even under these 
asymptotics, histogram estimation in this sense is not a high dimensional problem. The dimension 
of the target density considered here is on the order of exp{M^(logn)}, a rate somewhere between 
log n and log log n. 

Theorem 3.1. /// is the histogram estimator based on a (possibly vector valued) sample X™ from 
a (3-mixing sequence with stationary density f, then for all e > E J \f — f\ , 

j l/-/l>e) <2exp|- / ^ e i|+2(/i n -l)/3(m ri ) (5) 
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where e± = e — E f \ f — f\ ■ 



To prove this result, we use the blocking method of [38] to transform the dependent /3-mixing 
sequence into a sequence of nearly independent blocks. We then apply McDiarmid's inequality to 
the blocks to derive asymptotics in the bandwidth of the histogram as well as the dimension of the 
target density. For completeness, we state Yu's blocking result and McDiarmid's inequality before 
proving the doubly asymptotic histogram convergence for IID data. Combining these lemmas allows 
us to derive rates of convergence for histograms based on /3-mixing inputs. 

Lemma 3.2 (Lemma 4.1 in [38]). Let (f> be a measurable function with respect to the block sequence 
U uniformly bounded by M. Then, 



where the first expectation is with respect to the dependent block sequence, U, and E is with respect 
to the independent sequence, U'. 

This lemma essentially gives a method of applying IID results to /3-mixing data. Because the 
dependence decays as we increase the separation between blocks, widely spaced blocks are nearly 
independent of each other. In particular, the difference between expectations over these nearly 
independent blocks and expectations over blocks which are actually independent can be controlled 
by the /3-mixing coefficient. 

Lemma 3.3 (McDiarmid Inequality [21]). Let X\, . . . ,X n be independent random variables, with 
Xi taking values in a set Ai for each i. Suppose that the measurable function f : Y\ A% — > K satisfies 



The following lemma provides the doubly asymptotic convergence of the histogram estimator 
for IID data. It differs from standard histogram convergence results in the bias calculation. In this 
case we need to be more careful about the interaction between d and h n . 




(6) 



|/(x)-/(x')|< Ci 



whenever the vectors x and x' differ only in the i th coordinate. Then for any e > 0, 



P(/ - Ef > e) < exp 




Lemma 3.4. For an IID sample X\ 



. . . , X n from some density f on ~R. d 




(7) 



(8) 



where f is the histogram estimate using a grid with sides of length h. 
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Proof of Lemma 3.4- Let pj be the probability of falling into the j th bin Bj. Then, 

J 



7=1 n j=l 



3=1 
J 



Pj 
h d 



< 



J 1 / 



1 / 

V 3=1 

0{n- l l 2 )0{h- d ' 2 ) = O (l/Jnhi) . 



For the second claim, consider the bin Bj centered at c. Let / be the union of all bins Bj. Assume 
the following regularity conditions as in [14]: 

1. / € L 2 and / is absolutely continuous on /, with a.e. partial derivatives fi = ^r/(y) 

2. fi £ L 2 and /, is absolutely continuous on /, with a.e. partial derivatives fa- = g^/i(y) 

3. fit £ L 2 for all i, k. 
Using a Taylor expansion 



/(x) = /(c) + J2( Xi - Ci)fi(c) + 0(d 2 h 



n 1 1 



i=l 



where fi(y) = ^-/(y)- Therefore, pj is given by 



Pj = I f(x)dx = h d J(c) + 0(d 2 h d n +2 ) 



since the integral of the second term over the bin is zero. This means that for the j bin, 

E/ n (x) - f{x) = - f{x) 



J2(x i -c i )f i (c) + 0(d 2 h 2 n ). 
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Therefore, 



Ef n (x) - f{x) 



J2(x i -c i )f i (c) + 0(d 2 h 2 n ) 



< 



d 



i=l 



+ 



^(Xi ~ Ci)fi{c) 



i=l 



0{d 2 h 2 ) 
+ 0{d 2 h 2 n +d ) 



0(dh d n +1 ) + 0(d 2 h 



2+d\ 



Since each bin is bounded, we can sum over all J bins. The number of bins is J = h n by definition, 
so 



|E/n(x) - f(x)\dx = 0(h~ d ) (0(dh d+l ) + 0{d 2 h z n +d ) 
= 0(dh n ) + 0(d 2 h 2 n ). 



□ 



We can now prove the main result of this section. 



Proof of Theorem 3.1. Let g be the L 1 loss of the histogram estimator, g = J \f — f n \. Here 
fn{x) = -tj YH=i € Bj(x)) where Bj{x) is the bin containing x. Let fjj, /v, and /u' be 

7lfl n 

histograms based on the block sequences U, V, and U' respectively. Clearly f n = \(f\j + /v)- 
Now, 



> e 



\f-fn\ >e 



f — fu f — fv 



> e 



< 



1 



l/-/u| + 2 / l/-/v|>e 



?u + Qv > 2e) 
<P(r 7u >e)+P( 5v >e) 
= 2F(g v -E[g v ]>e-E[g v }) 
= 2P( 5u - E[ 5u ,] >e-E\g v ,]) 
= 2F(g v -E[g v ,]>e 1 ), 

where ei = e — E^u']- Here, 

R\ffu>] \fu> -Efv\dx + f |E/u/ - f\dx, 

so by Lemma 3.4, as long as for fi n — > oo, h n 1 and /i n ^n ~~ ^ °°' then for all e there exists reo(e) 
such that for all n > no(e), e > E[g] = E[gu']- Now applying Lemma 3.2 to the expectation of the 
indicator of the event {g\j — E[<7u'] > e i} gives 

2P( 5u - E[g w ] > ei) < 2F{g w - E[ 5u /] > ei) + 2(^ n - l)/3(m n ) 



S 



where the probability on the right is for the a- field generated by the independent block sequence U'. 
Since these blocks are independent, showing that <?u' satisfies the bounded differences requirement 
allows for the application of McDiarmid's inequality 3.3 to the blocks. For any two block sequences 
u[, . . . , u'^ n and u^, . . . , u' with u' e = u'g for all I ^ j, then 



> U 'l*n)\ 



Therefore, 



l/(y; )-f(y)\dy 
l/(y; )-f(y)\dy 



< / l/(y; u' ) - f(y, «!,..., u^ n )|dy 



> e) < 2P( 5u , - E[ffu/] > ei) + 2(fi n - l)/3(m r> 
< 2exp {-^y 1 } + 2(/x n - l)/3(m n ). 



□ 



4 Proofs of results in §2.2 



The proof of Theorem 2.4 relies on the triangle inequality and the relationship between total 
variation distance and the L 1 distance between densities. 

Proof of Theorem 2.4- For any probability measures v and A defined on the same probability space 
with associated densities f v and f\ with respect to some dominating measure 7r, 



All 



TV 



Let P be the d-dimensional stationary distribution of the d th order Markov process, i.e. P = 
^t— d+l = ^t+a +d_1 i n the notation of equation 3. Let P aj d be the joint distribution of the bivariate 
random process created by the initial process and itself separated by a — 1 time steps. By the 
triangle inequality, we can upper bound (3 d (a) for any d = d n . Let P and P a) d be the distributions 
associated with histogram estimators f d and f a ld respectively. Then, 



f3 d (a) 



P®P- 
P®P 



"^a,d\ \ r py 
P®P + P< 



P 



< 

< 2 



P®P-P< 



P-P 



TV 



+ 



+ 

TV 

P(g)P 



P 



a,d 



'a,d + ™a,d 
P ~ $a,d 
+ 



a,d 



TV 



TV 



TV 

a,d 
Tld 



a,d 



TV 



TV 



J If" ~ /"I + \ J If' ® I' ~ FA + \ J \fl d ~ fa 
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where \ f \f d ®f d -fl d \ 

is our estimator j3 d {a) and the remaining terms are the L 1 distance between 
a density estimator and the target density. Thus, 



f3 d (a)-(3 d (a)< / \ f d - f d \ + 



A similar argument starting from (3 d (a) = 

P d (a)-P d (a)> 



P®P 

d 



\r-r 



so we have that 



a,d\\Ty 
J|_l 

1 2 
1 



\fl d -ft d \ 



shows that 

2d 



/?»-/?» < I \f d -f d \ + ^ I \ft d ~ fl 



72d\ 
J a I J 



Therefore, 



f3 d (a) - p d {a) 



>e < 



< 



where ei = e/2 - E J \f d - f° 



< 2exp 



and €2 = e 



l/"-/ d |> 

+ 2 exp 



l/a M -^l> e 



2 

- E 



i/a M -/n> 



+ 4(/i n - l)/3(m n ) 



J* I /a /a 



2(/ 



□ 



The proof of Theorem 2.3 requires two steps which are given in the following Lemmas. The first 
specifies the histogram bandwidth h n and the rate at which d n (the dimensionality of the target 
density) goes to infinity. If the dimensionality of the target density were fixed, we could achieve 
rates of convergence similar to those for histograms based on IID inputs. However, we wish to allow 
the dimensionality to grow with n, so the rates are much slower as shown in the following lemma. 

Lemma 4.1. For the histogram estimator in Lemma 3.4, let 

d n ~ exp{Fy(logn)}, 



h n ~ ri 



with 



W{\ogn) + \ logn 



n logn (| exp{W(logn)} + l) 
These choices lead to the optimal rate of convergence. 



i> " n for some k n to be determined. Then we want n l / 2 h n dn ^ 2 = n^ kndn x )/ 2 — > 0, 



Proof. Let h n = n~ kn 

d n h n = d n n~ k — > 0, and d^h^ = d\n~ 2k — > all as n — > 00. Call these A, B, and C. Taking A and 
B first gives 

n (k n d n -i)/2 ^ d n n~ kn 

1 



^(k n d n - 1) log n ~ log d n - k n log n 
k n log n yjd n + 1 j ~ log d n + - log n 



log d n + \ log n 
\ogn(\d n + l) 



(9) 
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Similarly, combining A and C gives 



2 log d n + i log n 

kn ~ 1 n , ' ^ • ( 10 ) 

logn (Jd n + 2) 



Equating (9) and (10) and solving for d n gives 

=4> d n ~ exp {VK(logn)} 
where W(-) is the Lambert W function. Plugging back into (9) gives that 



h n = n 



where 

W{\ogn) + h logn 



n logn exp {W(logn)} + l) 

□ 

It is also necessary to show that as d grows, P d (a) — > (3(a). We now prove this result. 
Lemma 4.2. (3 d (a) converges to (5(a) as d — >• oo. 

Proof. Because the process is stationary, let t = in Definition 2.1 without loss of generality. Let 
P^oo be the distribution on az.^ = a(. . . , X-\, Xq), and let P^° be the distribution on = 
a(X a , X a+ i, X a+ 2, ■ ■ ■)■ Let P a be the distribution on a = (g> (the product sigma-field) . 
Then we can rewrite Definition 2.1 using this notation as 

^(a) = S up|P a (C)-[P° 00 8)P-](C)|. 

Let cr°_ d+l and <7^ +<J_1 be the sub-cr- fields of and consisting of the d-dimensional cylinder 
sets for the d dimensions closest together. Let o~ d be the product cr-field of these two. Then we can 
rewrite (3 d (a) as 

(3 d (a) = sup ||P a (C) - [P^ ®¥™}(C)\. (11) 

As such (3 d (a) < /3(a) for all a and d. We can rewrite (11) in terms of finite-dimensional marginals: 

(3 d (a) = sup |P M (C) - [P^+x^P^KC)!, 
cea d 

where P a ^ is the restriction of P to a(X_ d+ i, . . . , Xq, X a , . . . , X a+C i~i). Because of the nested nature 
of these sigma-fields, we have 

/3 dl (a) < P d2 {a) < /3(a) 

for all finite d\ < aV Therefore, for fixed a, {(3 d (a)} d < ^ =1 is a monotone increasing sequence which 
is bounded above, and it converges to some limit L < /3(a). To show that L = /3(a) requires some 
additional steps. 

Let R = P a — [P^oq £3 P^ ], which is a signed measure on a. Let 

R d = F a>d -[F°_ d+1 ®FZ +d - 1 }, 

which is a signed measure on a d . Decompose R into positive and negative parts as R = Q + — Q~ 
and similarly for R d = Q +d — Q~ d . Notice that since R is constructed using the marginals of P, 
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then R(E) = R d (E) for all E E a d . Now since R is the difference of probability measures, we must 
have that 

= 12(0) =Q+(Q)-Q-(0) 
= Q + (L>) + Q+(Z) C )-Q-(L>)-Q-(Z) C ) (12) 

for all D £ a. 

Define Q = Q + + Q~ . Let e > 0. Let C G a be such that 

Q(C) = /3(a) = Q+(C) = g-(C c ). (13) 

Such a set C is guaranteed by the Hahn decomposition theorem (letting C* be a set which attains the 
supremum in (11), we can throw away any subsets with negative R measure) and (12) assuming 
without loss of generality that F a (C) > [P ^ <g> P£°](C). We can use the field 07 = \J d a d to 
approximate a in the sense that, for all e, we can find A £ Of such that Q(AAC) < e/2 (see 
Theorem D in Halmos [17, §13] or Lemma A. 24 in Schervish [28]). Now, 

Q(AAC) = Q(A n C c ) + Q(C n A c ) 

= Q-{AC\C C ) + Q + (CC\A C ) 

by (13) since A n C c C C c and C n C C. Therefore, since Q(^AC) < e/2, we have 

Q^(inC c )<e/2 (14) 
Q + (#nC) < e/2. 

Also, 

Q(C) = Q(AnC) + Q(#nC) 
= Q+(inC) + Q + (#nC) 

<Q + (A) + e/2 

since A n C and A c flC are contained in C and AnC Q A. Therefore 

Q+{A) > Q(C) - e/2. 

Similarly, 

Q-(A) = Q-(AnC) + Q-(AnC c ) < + e/2 = e/2 
since iflCCC and Q~(C) = by (14). Finally, 

Q +d {A) > Q +d (A) - Q- d (A) = R d (A) 
= R(A) = Q+(A)-Q'(A) 
> Q{C) - e/2 - e/2 = Q(C) - e 
= /3(a) - e. 

And since (3 d (a) > Q +d (A), we have that for all e > there exists d such that for all d\ > d, 

P dl (a) > P d (a) > Q +d (A) > /3(a) - e. 
Thus, we must have that L = /3(a), so that f3 d (a) — > /3(a) as desired. □ 
Proof of Theorem 2. 3. By the triangle inequality, 

\P^{a) - P(a)\ < |^(o) - /3 d "(a)| + \(3 d "(a) - /3(a)|. 

The first term on the right is bounded by the result in Theorem 2.4, where we have shown that 
d n = 0(exp{W(logn)}) is slow enough for the histogram estimator to remain consistent. That 

(3 d ™ (a) dn ~* 00 ) follows from Lemma 4.2. □ 
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PA,A = 1/2 




PA,B = 1/2 




PB,A = 1 

Figure 1: Two-state Markov chain St used for simulation results 

5 Performance in simulations 

To demonstrate the performance of our proposed estimator, we examine its performance in three 
simulated examples. The first example is a simple two state Markov chain. The second example 
takes this Markov chain as an unobserved input and outputs a non-Markovian binary sequence 
which remains /3-mixing. Finally, we examine an autoregressive model. 

As shown in [9], homogeneous recurrent Markov chains are geometrically /3-mixing, i.e. /3(a) = 
0(p a ) for some < p < 1. In particular, if the Markov chain has stationary distribution tt and 
a-step transition distribution P a , then 



/3(a) = / n(dx) \\P a (x) - tt\ \ tv . (15) 

Consider first the two-state Markov chain St pictured in Figure 1. By direct calculation using 
(15), the mixing coefficients for this process are (3(a) = | (|)°< We simulated chains of length 
n = 1000 from this Markov chain. Based on 1000 replications, the performance of the estimator 
is depicted in Figure 2. Here, we have used two bins in all cases, but we allow the Markov 
approximation to vary as d £ {1,2,3}, even though d = 1 is exact. The estimator performs well 
for a < 5, but begins to exhibit a positive bias as a increases. This is because the estimator 
is nonnegative, whereas the true mixing rates are quickly approaching zero. The upward bias is 
exaggerated for larger d. This bias will go to as n — > oo. 

As an example of a long memory process, we construct, following Weiss [34], a partially ob- 
servable Markov process which we refer to as the "even process" . Let Xt be the observed sequence 
which takes as input the Markov process St constructed above. We observe 

Xt= U (St,S t -i) = (A,B)or(B,A) 
|0 else. 

Since St is Markovian, the joint process (St,St-i) is as well, so we can calculate its mixing rate 
/3(a) = § (5) • The even process must also be /3-mixing, and at least as fast as the joint process, 
since it is a measurable function of a mixing process. However, Xt itself is non-Markovian: sequences 
of one's must have even lengths, so we need to know how many one's have been observed to know 
whether the next observation can be zero or must be a one. Thus, the true mixing coefficients 
are bounded above, though unknown. Using the same procedure as above, Figure 3 shows the 
estimated mixing coefficients. Again we observe a bias for a large due to the nonnegativity of the 
estimator. 

Finally, we estimate the /3-mixing coefficients for an AR(1) model 

Z t = 0.5Z t - 1 + r ]t r/t~N(0,l). 
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Figure 2: This figure illustrates the performance of our proposed estimator for the two-state Markov 
chain depicted in Figure 1. We simulated length n = 1000 chains and calculated /3 d (a) for d = 1 
(circles), d = 2 (triangles), and d = 3 (squares). The dashed line indicates the true mixing 
coefficients. We show means and 95% confidence intervals based on 1000 replications. 
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Figure 3: This figure illustrates the performance of our proposed estimator for the even process. 
Again, we simulated length n = 1000 chains and calculated (3 d (a) for d = 1 (circles), d = 2 
(triangles), and d = 3 (squares). The dashed line indicates an upper bound on the true mixing 
coefficients. We show means and 95% confidence intervals based on 1000 replications. 

While, this process is Markovian, there is no closed form solution to (15), so we calculate it via 
numerical integration. Figure 4 shows the performance of the estimator for d = 1. We select the 
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Figure 4: This figure illustrates the performance of our proposed estimator for the AR(1) model. 
We simulated length n = 3000 chains and calculated (3(a) for d = 1. The dashed line indicates the 
true mixing coefficients calculated via numerical integration. We show means and 95% confidence 
intervals based on 1000 replications. 

bandwidth h for each a by minimizing 



where we calculate the expectation based on independent simulations from the process. Figure 4 
shows the performance for n = 3000. The optimal number of bins is 33, 11, 7, 5, and 3 for 
a = 1, ... ,5 and 1 for a > 5. However, since the use of one bin corresponds to an estimate of zero, 
the figure plots the estimate with two bins. Using two bins, we again see the positive bias for a > 5. 



6 Discussion 

We have shown that our estimator of the /3-mixing coefficients is consistent for the true coefficients 
/3(a) under some conditions on the data generating process. There are numerous results in the 
statistics literature which assume knowledge of the /3-mixing coefficients, yet as far as we know, 
this is the first estimator for them. An ability to estimate these coefficients will allow researchers 
to apply existing results to dependent data without the need to arbitrarily assume their values. 
Additionally, it will allow probabilists to recover unknown mixing coefficients for stochastic pro- 
cesses via simulation. Despite the obvious utility of this estimator, as a consequence of its novelty, 
it comes with a number of potential extensions which warrant careful exploration as well as some 
drawbacks. 

Several other mixing and weak-dependence coefficients also have a total- variation flavor, perhaps 
most notably a- mixing [12, 10, 6]. None of them have estimators, and the same trick might well 
work for them, too. 

The reader will note that Theorem 2.3 does not provide a convergence rate. The rate in 
Theorem 2.4 applies only to the difference between (3 d (a) and /3 rf (a). In order to provide a rate 




(17) 
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in Theorem 2.3, we would need a better understanding of the non-stochastic convergence of f3 d {a) 
to /3(a). It is not immediately clear that this quantity can converge at any well-defined rate. In 
particular, it seems likely that the rate of convergence depends on the tail of the sequence {/3(a)}^? =1 . 

The use of histograms rather than kernel density estimators for the joint and marginal densities 
is surprising and perhaps not ultimately necessary. As mentioned above, Tran [31] proved that 
KDEs are consistent for estimating the stationary density of a time series with /3-mixing inputs, so 
perhaps one could replace the histograms in our estimator with KDEs. However, this would need 
an analogue of the double asymptotic results proven for histograms in Lemma 3.4. In particular, 
we need to estimate increasingly higher dimensional densities as n — > oo. This does not cause a 
problem of small-n-large-d since d is chosen as a function of n, however it will lead to increasingly 
higher dimensional integration. For histograms, the integral is always trivial, but in the case of 
KDEs, the numerical accuracy of the integration algorithm becomes increasingly important. This 
issue could swamp any efficiency gains obtained through the use of kernels. However, this question 
certainly warrants further investigation. 

The main drawback of an estimator based on a density estimate is its complexity. The mix- 
ing coefficients are functionals of the joint and marginal distributions derived from the stochastic 
process X, however, it is unsatisfying to estimate densities and solve integrals in order to estimate 
a single number. Vapnik's main principle for solving problems using a restricted amount of in- 
formation is "When solving a given problem, try to avoid solving a more general problem as an 
intermediate step [33, p. 30]." However, despite our estimator's complexity, we are able to obtain 
nearly parametric rates of convergence to the Markov approximation departing only by logarithmic 
factors. While the simplicity principle is clearly violated, perhaps our seed will precipitate a more 
aesthetically pleasing solution. 
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